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Abstract 

We address the problem of automatic generation of features for value function 
approximation. Bellman Error Basis Functions (BEBFs) have been shown to im- 
prove the error of policy evaluation with function approximation, with a conver- 
gence rate similar to that of value iteration. We propose a simple, fast and ro- 
bust algorithm based on random projections to generate BEBFs for sparse feature 
spaces. We provide a finite sample analysis of the proposed method, and prove 
that projections logarithmic in the dimension of the original space are enough to 
guarantee contraction in the error. Empirical results demonstrate the strength of 
this method. 

1 Introduction 

The accuracy of parametrized policy evaluation depends on the quality of the features used for 
estimating the value function. Hence, feature generation/selection in reinforcement learning (RL) 
has received a lot of attention (e.g. [ 1 , 2i[3]l!l[5)). We focus on methods that aim to generate features 
in the direction of the Bellman error of the current value estimates (Bellman Error Based, or BEBF, 
features). Successive addition of exact BEBFs has been shown to reduce the error of a linear value 
estimator at a rate similar to value iteration [6]. Unlike fitted value iteration [7| which works with 
a fixed feature set, iterative BEBF generation gradually increases the complexity of the hypothesis 
space by adding new features and thus does not diverge, as long as the error in the generation does 
not cancel out the contraction effect of the Bellman operator J6). 

A number of methods have been introduced in RL to generate features related to the Bellman error, 
with a fair amount of success |5j Q] |4] |6] O, but many of them fail to scale to high dimensional 
state spaces. In this work, we present an algorithm that uses the idea of applying random projec- 
tions specifically in very large and sparse feature spaces. In short, we iteratively project the original 
features into exponentially smaller-dimensional spaces and apply linear regression to temporal dif- 
ferences to approximate BEBFs. We carry out a finite sample analysis that helps determine valid 
sizes of the projections and the number of iterations. Our analysis holds for both finite and continu- 
ous state spaces and is easy to apply with discretized or tile-coded features. 

The proposed method is computationally favourable to many other feature extraction methods in 
high dimensional spaces, in that each iteration takes poly-logarithmic time in the number of dimen- 
sions. While providing guarantees on the reduction of the error, it needs minimal domain knowledge, 
as agnostic random projections are used in the process. 

Our empirical analysis shows how the algorithm can be applied to general tile-coded spaces. Our 
results indicate that the proposed method outperforms both gradient type methods, and also LSTD 
with random projections |8|. The algorithm is robust to the choice of parameters and needs minimal 
tweaking to work. It runs fast and has small memory complexity. 
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2 Notations and Background 

Throughout this paper, column vectors are represented by lower case bold letters, and matrices are 
represented by bold capital letters. |.| denotes the size of a set, and A4(X) is the set of measures 
on X. ||.||o is Donoho's zero "norm" indicating the number of non-zero elements in a vector. |.| 
denotes the L 2 norm for vectors and the operator norm for matrices: ||M|| = sup v ||Mv||/||v||. 

The Frobenius norm of a matrix is the defined as: ||M||i? = Mf^ Also, we denote the 

Moore-Penrose pseudo-inverse of a matrix M with M^. The weighted L 2 norm is defined as: 



We focus on spaces that are large, bounded and fc-sparse. Our state is represented by a vector 
x e X of D features, having ||x|| < 1, We assume that x is fc-sparse in some known or unknown 
basis \&, implying that X = {\I/z, s.t. ||z||o < k and |z| < 1}. Such spaces occur both naturally 
(e.g. image, audio and video signals (5J) and also from most discretization-based methods (e.g. 
tile-coding). 

2.1 Markov Decision Process and Fast Mixing 

A Markov Decision Process (MDP) M — (X,A, T, R) is defined by a (possibly infinite) set of states 
X, a set of actions A, a transition probability kernel T : X x A — > M.{X), where T(.|x, a) defines 
the distribution of next state given that action a is taken in state x, and a (possibly stochastic) reward 
function R : X x A —> M([0, i? max ]). Throughout the paper, we focus on discounted-reward 
MDPs, with the discount factor denoted by 7 e [0, 1). At discrete time steps, the reinforcement 
learning agent chooses an action and receives a reward. The environment then changes to a new 
state according to the transition kernel. 

A policy is a (possibly stochastic) function from states to actions. The value of a state x for policy 
7r, denoted by y T (x), is the expected value of the discounted sum of rewards 7*r t ) if the agent 
starts in state x and acts according to policy n. Defining i?(x, tt(x)) to be the expected reward at 
point x under policy ir, the value function satisfies the Bellman equation: 



There are many methods developed to find the value of a policy (policy evaluation) when the tran- 
sition and reward functions are known. Among these there are dynamic programming methods 
in which one iteratively applies the Bellman operator ifTUll to an initial guess of the optimal value 
function. The Bellman operator T on a value estimate V is defined as: 



When the transition and reward models are not known, one can use a finite sample set of transitions 
to learn an approximate value function. Least-squares temporal difference learning (LSTD) and its 
derivations ifTTl [T21 are among the methods used to leam a value function based on a finite sample. 
LSTD type methods are efficient in their use of data, but fail to scale to high dimensional state spaces 
due to extensive computational complexity. Using LSTD in spaces induced by random projections 
is a way of dealing with such domains [8|. Stochastic gradient descent type method are also used 
for value function approximation in high dimensional state spaces, some with proofs of convergence 
in online and offline settings lfl3l . However gradient type methods typically have slow convergence 
rates and do not make efficient use of the data. 

To arrive at a finite sample bound on the error of our algorithm, we assume certain mixing conditions 
on the Markov chain in question. We assume that the Markov chain uniformly quickly forgets its 
past (defined in detail in the appendix). There are many classes of chains that fall into this category 
(see e.g. fl4l ). Conditions under which a Markov chain uniformly quickly forgets its past are of 
major interest and are discussed in the appendix. 




(1) 




(2) 




(3) 
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2.2 Bellman Error Based Feature Generation 



In high-dimensional state spaces, direct estimation of the value function fails to provide good results 
with small numbers of sampled transitions. Feature selection/extraction methods have thus been 
used to build better approximation spaces for the value functions Q] |2] [3] 0] [5) . Among these, we 
focus on methods that aim to generate features in the direction of the the Bellman error defined as: 

ev(-)=TV(.)-V(.). (4) 

Let S n = ((x t ,r t )™ =1 ) be a random sample of size n, collected on an MDP with a fixed policy. 
Given an estimate V of the value function, temporal difference (TD) errors are defined to be: 

St = rt+yV(x t+ i)-V(x t ). (5) 

It is easy to show that the expectation of the temporal difference given a point x f equals the Bellman 
error on that point ifTOl . TD-errors are thus proxies to estimating the Bellman error. 

Using temporal differences, Menache et al. ifTSI introduced two algorithms to adapt basis functions 
as features for linear function approximation. Keller et al. [3 1 applied neighbourhood component 
analysis as a dimensionality reduction technique to construct a low dimensional state space based 
on the TD-error. In their work, they iteratively add feature that would help predict the Bellman error. 
Parr et al. [6 | later showed that any BEBF extraction method with small angular approximation error 
will provably tighten approximation error in the value function estimate. 

Online feature extraction methods have also been studied in the RL literature. Geramifard et al. Q 
have recently introduced the incremental Feature Dependency Discovery (iFDD) as a fast online 
algorithm to extract non-linear binary feature for linear function approximation. In their work, one 
keeps a list of candidate features (non-linear combination of two active features), and among these 
adds the features that correlates the most with the TD-error. 

In this work, we propose a method that generates BEBFs using linear regression in a small space 
induced by random projection. We first project the state features into a much smaller space and then 
regress a hyperplane to the TD-errors. For simplicity, we assume that regardless of the current esti- 
mate of the value function, the Bellman error is always linearly representable in the original feature 
space. This seems like a strong assumption, but is true, for example, in virtually any discretized 
space, and is also likely to hold in very high dimensional feature spacesM 



2.3 Random Projections and Inner Product 

It is well known that random projections of appropriate sizes preserve enough information for exact 
reconstruction with high probability (see e.g. lfl6l[T7l ). This is because random projections are norm 
and distance-preserving in many classes of feature spaces ifTTl . 

There are several types of random projection matrices that can be used. In this work, we assume that 
each entry in a projection $> Dxd is an i.i.d. sample from a Gaussian]^] 

^=N(0,l/d). (6) 

Recently, it has been shown that random projections of appropriate sizes preserve linearity of a target 
function on sparse feature spaces. A bound introduced in |[T8l and later tightened in [ 19] shows that 
if a function is linear in a sparse space, it is almost linear in an exponentially smaller projected space. 
An immediate lemma based on Theorem 2 of 1 19 1 bounds the bias induced by random projections: 

Lemma 1. Let $ Dxd be a random projection according to Eqn^ Let X be a D-dimensional 
k-sparse space. Fix w G and 1 > £ > 0. Then, with probability > 1 — £: 

Vxe X : |(* T w,* T x> - (w,x)| < e^||w||||x||, (7) 
where ^ ^^-f log 

'in more general cases, the analysis has to be done with respect to the projected Bellman error (see e.g. |6|). 
We assume linearity of the Bellman error to simplify the derivations. 

2 The elements of the projection are typically taken to be distributed with A/"(0, 1 /D), but we scale them by 
y/D/d, so that we avoid scaling the projected values (see e.g. [16]). 



3 



Hence, projections of size 0(k log D) preserve the linearity up to an arbitrary constant. Along with 
the analysis of the variance of the estimators, this helps us bound the prediction error of the linear 
fit in the compressed space. 

3 Compressed Linear BEBFs 

Linear function approximators can be used to estimate the value of a given state. Let V m be an 
estimated value function described in a linear space defined by a feature set {ipx, ■ ■ ■ ip m }- Parr et al. 
|6| show that if we add a new BEBF VVn+i = e v m to the feature set, (with mild assumptions) the 
approximation error on the new linear space shrinks by a factor of 7. They also show that if we can 
estimate the Bellman error within a constant angular error, cos _1 (7), the error will still shrink. 

Estimating the Bellman error by regressing to temporal differences in high-dimensional sparse 
spaces can result in large prediction error. However, as discussed in Lemma [T] random projec- 
tions were shown to exponentially reduce the dimension of a sparse feature space, only at the cost 
of a controlled constant bias. A variance analysis along with proper mixing conditions can also 
bound the estimation error due to the variance in MDP returns. One can thus bound the total pre- 
diction error with much smaller number of sampled transitions when the regression is applied in the 
compressed space. 

In light of these results, we propose the Compressed Bellman Error Based Feature Generation al- 
gorithm (CBEBF). To simplify the bias-variance analysis and avoid multiple levels of regression, 
we present here a simplified version of compressed BEBF-based regression, in that new features are 
added to the value function approximator with constant weight 1 (i.e. no regression is applied on the 
generated BEBFs): 



Algorithm 1: Simplified Compressed BEBFs 

Input: Sample trajectory S n = ((x t , r t )™ =1 ), where x t is the observation received at time t, and r t 

is the observed reward; Number of BEBFs: m; Projection size schedule: d\, g?2, • • • , d m 
Output: w: the linear coefficient of the value function approximator 
w Dxl 4-0; 
for i <— 1 to m do 

Generate random projection & Dxdi according to Eqn|6j 

Calculate TD-errors: S t = r t + jxf +1 w — xfw; 

Let w' diXl be the ordinary least-squares parameter using <fr T x t as inputs and 5t as outputs; 
Update w 4— w + 4>w' ; 

end 



The optimal number of BEBFs and the schedule of projection sizes need to be determined and are 
subjects of future discussion. But we show in the next section that logarithmic size projections 
should be enough to guarantee the reduction of error in value function prediction at each step. This 
makes the algorithm very attractive when it comes to computational and memory complexity, as the 
regression at each step is only on a small projected feature space. As we discuss in our empirical 
analysis, the algorithm is very fast and robust with respect to the selection of parameters. 

One can view the above algorithm as a model selection procedure that gradually increases the com- 
plexity of the hypothesis space by adding more BEBFs to the feature set. This means that the 
procedure has to be stopped at some point to avoid over-fitting. This is relatively easy to do, as 
one can use a validation set and compare the estimated values against the empirical returns. The 
generation of BEBFs should stop when the validation error starts to rise. 

Finite Sample Analysis 

This section provides a finite sample analysis of the proposed algorithm. Parts of the analysis are 
not tight and could use further work, but the bound suffices to prove reduction of the error as new 
BEBFs are added to the feature set. 
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The following theorem shows how well we can estimate the Bellman error by regression to the TD- 
errors in a compressed space. It highlights the bias-variance trade-off with respect to the choice of 
the projection size. 

Theorem 2. Let <J> Dxd be a random projection according to Eqn^ Let S n — ((x t , r t )™ =1 ) be a 
sample trajectory collected on an MDP with a fixed policy with stationary distribution p, in a D- 
dimensional k-sparse feature space. Fix any estimate V of the value function, and the corresponding 
TD-errors 5t 's bounded by ±£ max . Assume that the Bellman error is linear in the features with 

parameter w. For OLS regression we have = (X«&)t<5, where X is the matrix containing x 4 's 
and S is the vector of TD-errors. Assume that X is of rank larger than d. There exist constants C1...4 
depending only on the mixing conditions of the chain, such that for any fixed < £i...s < 1, with 
probability no less than 1 — (£x + £ 2 + £3 + £4 + £5): 
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(8) 
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+0(0. (ii) 

where e^p is according to LemmaQ m max = max zg ^ ||z T $|| and S$ is the feature covariance 
matrix under measure p. 

Detailed proof is included in the appendix. The sketch of the proof is as follows: Lemma[T]suggests 
that if the Bellman error is linear in the original features, the bias due to the projection can be 
bounded within a controlled constant error with logarithmic size projections (first line in the bound). 
If the Markov chain "forgets" exponentially fast, one can bound the on-measure variance part of the 
error by a constant error with similar sizes of sampled transitions [20 1 (second and third line of the 
bound). 

Theorem [2] can be further simplified by using concentration bounds on random projections as de- 
fined in Eqn[6] The norm of 3? can be bounded using the bounds discussed in ETH : we have with 
probability 1 — (5$: 



1 



1*11 < ^Djd+^(2log(2/6 <s> ))/d 
ll^l! < \y/D/d-y/(2\og(2/6*))/d-l 



and 

-1 



Similarly, when n > d, and the observed features are well-distributed, we expect that ||X<I>|| is 
of order 0{^n/d) and ||(X<i>)t|| is of order 0(Wd/n). Also note that the projections are norm- 
preserving and thus m max ~ 1. We also have jjS^ 1 1| < d. Assuming that n > d, we can rewrite 
the bound on the error up to logarithmic terms as: 



6 VklogD 




(12) 



The 1/Vd term is a part of the bias due to the projection (excess approximation error). The \J djn 
term is the variance term that shrinks with larger training sets (estimation error). We clearly observe 
the trade-off with respect to the compressed dimension d. With the assumptions discussed above, 
we can see that projection of size d = 0(k log D) should be enough to guarantee arbitrarily small 
bias, as long as ||w|| is small and n = 6(d 3 ) holdf^ 

The following two lemmas complete the proof on the shrinkage of the error in the value function 
prediction: 



3 Our crude analysis assumes that n = 0(d 3 ) 
which is 0((fclogD) 2 ) by our choice of d. 



We expect that this can be further brought down to 0(d 2 
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Lemma 3. Let V" be the value function of a policy 7r imposing stationary measure p, and let ey be 
the Bellman error under policy it for an estimate V. Given a BEBF ip satisfying: 

||V(x)-ev(x)|| p(x) <e|Mx)|| p(x) , (13) 

we have that: 

||^(x) - (V(x) + ^(x))|| p(x) < (7 + e + e 7 ) ll^(x) - V( X )\\ p{x) . (14) 

Theorem |2j does not state the error in terms of ||ey(x)|| p( - x ), but rather does it in term of the infinity 
norm e max . We expect a more careful analysis to give us a bound that could benefit directly from 
Lemma [3] However, we can still state the following immediate lemma about the contraction in the 
error: 

Lemma 4. Let V 7 ' be the value function of a policy 7r imposing stationary measure p, and let ey be 
the Bellman error under policy tt for an estimate V. Given a BEBF ip satisfying: 

||V>(x)-e v (x)|| p(x) <c, (15) 

we have that after adding the BEBF to the estimated value, either the error contracts: 

|| V(x) - (V(x) + V(x))|| p(x) < ||V*(x) - V(x)|| p(x) , (16) 

or the error is already small: 

||V-(x)-y(x)|| p(x) <li-t^c. (17) 



This means that if we can control the error in BEBFs by some small constant, we can shrink the 
error up to a factor of that constant. 



4 Empirical Analysis 

We evaluate our method on a challenging domain where the goal of the RL agent is to apply direct 
electrical neurostimulation such as to suppress epileptiform behavior in neural tissues. We use a 
generative model constructed from real-world data collected on slices of rat brain tissues l22l ; the 
model is available in the RL-Glue framework. Observations are generated over a 5-dimensional 
real-valued state space. The discrete action choice corresponds to selecting the frequency at which 
neurostimulation is applied. The model is observed at 5 steps per second. The reward is for 
steps when a seizure is occurring at the time of stimulation, 1/41 for when seizure happens without 
stimultion, 40/41 for each stimulation pulse, and 1 otherwiserl 

One of the challenges of this domain is that it is difficult to know a priori how to construct a good 
state representation. We use tile-coding to convert the continuous variables into a high dimensional 
binary feature space. We encode the policy as a 6th feature, divide each dimension into 6 tiles and 
use 10 randomly placed tile grids. That creates 10 x 6 6 = 466, 560 features. Only 10 of these are 
non-zero at any point, thus k = 10. 

We apply the best clinical fixed rate policy (stimulation is applied at a consistent 1Hz) to collect 
our sample set |22|. Since the true value function is not known for this domain, we thus define 
our error in terms of Monte Carlo returns on a separate test set. Give a test set of size I, Monte 
Carlo returns are defined to be the discounted sum of rewards observed at each point, denoted by 
Ufa). Now for any estimated value function V, we define the return prediction error (RP error) to 

be y/\Y, l i=AU(xi)-V(xi)) 2 . 

In our first experiment, we analyze the RP error as a function of the number of generated BEBFs, 
for different selections of the size of projection d. We run these experiments with two sample sizes: 
500 and 1500. The projection sizes are either 10, 20 or 30. Fixing d, we apply many iterations of the 
algorithm and observe the RP error on a testing set of size I = 5000. To account for the randomness 

4 The choice of the reward model is motivated by medical considerations. See 1221 . 
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induced by the projections, we run these experiments 10 times each, and take the average. Figure[T] 
includes the results under the described setting. 

It can seen in both plots in Figure [T[ that the RP error decreases to some minimum value after 
a number of BEBFs are generated, and then the error start increasing slightly when more BEBFs 
are added to the estimate. The increase is due to over-fitting and can be easily avoided by cross- 
validation. As stated before, this work does not include any analysis on the optimal number of 
iterations. Discussion on the possible methods for such optimization is an interesting avenue of 
future work. 




50 100 150 200 250 300 50 100 150 200 250 300 

Number of BEBFs (n - 500) Number of BEBFs (n = 1 500) 



Figure 1: RP error of CBEBF for different number of projections, under different choices of d, 
averaged over 10 trials. The dashed lines indicate ±1-STD of the mean. 

As expected, the optimal number of BEBFs depend heavily on the size of the projection: the smaller 
the projection, the more BEBFs need to be added. It is interesting to note that even though the 
minimum happens at different places, the value of the minimum RP error is not varying much as 
a function of the projection size. The difference gets even smaller with larger sample sizes. This 
means that the method is relatively robust with respect to the choice of d. We also observed small 
variance in the value of the optimal RP error, further confirming the robustness of the algorithm on 
this domain. 

There are only a few methods that can be compared against our algorithm due to the high dimen- 
sional feature space. Direct regression on the original space with LSTD type algorithms (regularized 
or otherwise) is impossible due to the computational complexity^] We expect most feature selection 
methods to perform poorly here, since all the features are of small and equal importance (note the 
different type of sparsity we assume in our work). The two main alternatives are randomized feature 
extraction (e.g. LSTD with random projections |8|) and online stochastic gradient type methods 
(e.g. GQ (A) algorithm H3l). 

LSTD with random projections (Compressed LSTD, CLSTD), discussed in (SI, is a simple algorithm 
in which one applies random projections to reduce the dimension of the state space to a manageable 
size, and then applies LSTD on the compressed space. We compare the RP error of CLSTD against 
our method. Among the gradient type methods, we chose the GQ (A) algorithm [13], as it was 
expected to provide good consistency. However, since the algorithm was very sensitive to the choice 
of the learning rate schedule, the initial guess of the weight vector and the A parameter, we failed to 
tune it to outperform even the CLSTD. The results on the GQ (A) algorithm are thus excluded from 
this section and should be addressed in future works 

For a fair comparison between CBEBF and CLSTD, we assumed the existence of an oracle that 
would choose the optimal parameters for these method]^] Therefore, we compare the best RP error 
on the testing set as we vary the parameters in question. Figure [2] shows the best RP error of the 
algorithms. For CLSTD, the best RP error is chosen among the solutions with varying projection 
sizes (extensive search). For CBEBF, we fix the projection size to be 20, and vary the number of 
generated BEBFs (iteratively) to find the optimal number of iterations that minimizes the RP error. 

5 Analysis of sparse linear solvers, such as LSQR |23|, is an interesting future work. 

6 A fair comparison cannot be made with gradient type methods in the absence of a good learning rate 
schedule. Typical choices were not enough to provide decent results. 

7 Note that since there are one or two parameters for these methods, cross-validation should be enough to 
choose the optimal parameter, though for simplicity the discussion of that is left out of this work. 
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Figure 2: RP error of CBEBF vs. CLSTD for different sample sizes, 
averaged over 10 trials. The error bars are ±1-STD of the mean. 

As seen in Figure [2] our method consistently outperforms CLSTD with a large margin, and the 
results are more robust with smaller variance. Comparing with the results presented in Figure [T] 
even the over-fitted solutions of CBEBF seem to outperform the best results of CLSTD. 

Each run of our algorithm with hundreds of BEBFs takes one or two minutes when working with 
thousands of samples and half a million features. The algorithm can easily scale to run with larger 
sample sizes and higher dimensional spaces, though a comparison cannot be made with CLSTD, 
since CLSTD (with optimal sizes of projection) fails to scale with increasing number of samples 
and dimensions. 

5 Discussion 

In this work, we provide a simple, fast and robust feature extraction algorithm for policy evalua- 
tion in sparse and high dimensional state spaces. Using recent results on the properties of random 
projections, we prove that in sparse spaces, random projections of sizes logarithmic in the original 
dimension are enough to preserve the linearity. Therefore, BEBFs can be generated on compressed 
spaces induced by small random projections. Our finite sample analysis provides guarantees on the 
reduction of error after the addition of the discussed BEBFs. 

Empirical analysis on a high dimensional space with unknown value function structure shows that 
CBEBF vastly outperforms LSTD with random projections and easily scales to larger problems. It 
is also more consistent in the output and has a much smaller memory complexity. We expect this 
behaviour to happen under most common state spaces. However, more empirical analysis should be 
done to confirm such hypothesis. Since the focus of this work is on feature extraction with minimal 
domain knowledge using agnostic random projections, we avoided the commonly used problem 
domains with known structures in the value function (e.g. mountain car ifTUl ). 

Compared to other regularization approaches to RL (JIHUES), our ran dom projection method does 
not require complex optimization, and thus is faster and more scalable. 

Of course finding the optimal choice of the projection size and the number of iterations is an inter- 
esting subject of future research. We expect the use of cross-validation to suffice for the selection 
of the optimal parameters due to the robustness in the choice of values. A tighter theoretical bound 
might also help provide an analytical closed form answer to these questions. 

Our assumption of the linearity of the Bellman error in the original space might be too strong for 
some state spaces. We avoided non-linearity in the original space to simplify the analysis. However, 
most of the discussions can be rephrased in terms of the projected Bellman error to provide more 
general results (e.g. see 0). 
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Appendix 



We start with concentration bound on the rapidly mixing Markov processes. These will be used to 
bound the variance of approximations build upon the observed values. 

6 Concentration Bounds for Mixing Chains 

We give an extension of Bernstein's inequality based on |20|. 

Let xi, . . . , x„ be a time-homogeneous Markov chain with transition kernel T(-| ) taking values in 
some measurable space X. We shall consider the concentration of the average of the Hidden-Markov 
Process 

(xi,/(xi)), . . . , (x n ,/(x„)), 

where / : X — > [0, b) is a fixed measurable function. To arrive at such an inequality, we need a 
characterization of how fast (xj) forgets its past. 

Fori > 0, let T l (-\x) be the i-step transition probability kernel: T l (A\x) = Pr{x.; + i € A | xi = x} 
(for all A C X measurable). Define the upper-triangular matrix T n = (7^) £ nxn as follows: 



for 1 < i < j < n and let 7^ = 1 (1 < i < n). 

Matrix r n , and its operator norm ||r„|j w.r.t. the Euclidean distance, are the measures of depen- 
dence for the random sequence xi,X2, . . . ,x n . For example if the x^'s are independent, r„ = I and 
||r„|| = 1. In general ||r„||, which appears in the forthcoming concentration inequalities for depen- 
dent sequences, can grow with n. Since the concentration bounds are homogeneous in n/ ||r„|| 2 , a 
larger value ||r ra || means a smaller "effective" sample size. 

We say that a time-homogeneous Markov chain uniformly quickly forgets its past if 



Further, r is called the forgetting time of the chain. Conditions under which a Markov chain uni- 
formly quickly forgets its past are of major interest. For further discussion on this, see lfl4ll . 

The following result from lfT4l is a trivial corollary of Theorem 2 of Samson |20| (Theorem 2 is 
stated for empirical processes and can be considered as a generalization of Talagrand's inequality to 
dependent random variables): 

Theorem 5. Let f be a measurable function on X whose values lie in [0, b], (xj)Kj< n be a homo- 
geneous Markov chain taking values in X and let T n be the matrix with elements defined by (TTS). 



7?,= sup ||T^(-|x)-T^(.|y)|| TV . 



(18) 



(x,y)GA-2 



r = sup ||r„|| 2 < +00. 

n>l 



(19) 



Let 



1 



z = 



n 



i=l 



Then, for every e > 0, 



P - E [z] > e) < exp 



( 



2b j 



r n || 2 (E [z)+e) 




The following is an immediate application of the above theorem: 
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Lemma 6. Let f be a measurable function over X whose values lie in [0, b]. Let f be the empirical 
average of f over the sample collected on the Markov chain. Under proper mixing conditions for the 
sample, there exists constants C\ > 0, C2 > 1 which depend only on T such that for any < £ < 1, 
w.p. 1 — £: 



n.n-f 




(20) 



7 Proof of The Theorem 2 

Proof of Theorem 2. To begin the proof of the main theorem, first note that we can write the TD- 
errors as the sum of Bellman errors and some noise term: 5 t = &v ( x * ) + Vt ■ These noise terms form 
a series of martingale differences, as their expectation is given all the history up to that point: 



E [7jt|xi . . .x t ,n . ..r t _i] = 0. 

We also have that the Bellman error is linear in the features, thus in vector form 

5 = Xw + rj. 

Using random projections, in the compressed space we have: 

S = (X*)(# T w) +b 



(21) 



(22) 



where b is the vector of bias due to the projection. Let b n 
that with probability 1 — £l, for all x € X: 



e (?1 
prj 



(23) 

wll. We have from Lemma 1 



|(x T *)(* T w) - e y (x)| = |(x T *)(* T w) - x T w| < b n 



(24) 



Thus, b is element-wise bounded in absolute value by b max with high probability. The weighted L 2 
error in regression to the TD-error as compared to the Bellman error will thus be: 



P(x) 



(x T *)(X*)t[(X*)(* T w) + b + V ] - ev(x)|| p(x) 

= ||(x T *)(* T w) + (x T *)(X*) t b + (x T *)(X*) t ?7 - ey(x) 

< ||(x T *)(* T w)- ey (x)|| Kx) 

+ ||(x T *)(X*)tb|| p(x) + ||(x T *)(X*)t ?? || p(x) 

< 6 max +||(x T #)(X*)tb|| p(x) + ||(x T *)(X*)t ?7 || p(x) . 



Ip(x) 



(25) 



The second term is the regression to the bias, and the third term is the regression to the noise. We 
present lemmas that bound these terms. The theorem is proved by the application of Lemma[7]and 
Lemma[lO] ' □ 



7.1 Bounding the Regression to Bias Terms 

Lemma 7. Under the conditions and with probability defined in Theorem 2: 



|(x T *)(x$)tb|| p(x) <K 



2m„ 




c 3 nd log 



c 4 



(26) 



Proof. Define wx = (X#)^b. Also define || .||„ to be the weighted L 2 norm uniform on the sample 
set X: 

1 " 

||/(x)||2=^^(/( Xi )) 2 . (27) 
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We start by bounding the empirical norm || (x T <&)wx||n- Given that (X$)wx is the OLS regres- 
sion on the observed points, its sum of squared errors should not be greater than any other linear 
regression, including the vector 0, thus ||(x T #)wx — &( X )IU — IIK X )IL- We get: 



||(x T $)w x ||„ < ||(x T *)w x - 6(x)||„ + ||6(x)||„ < 2||6(x)||„ < 2b n 

Let W = {u e R d s.t. ||u|| < 1}. Let S C W be an e-grid cover of W: 

Vv e W 3u e S : llu- vll < e. 



(28) 



(29) 



It is easy to prove (see e.g. Chapter 13 of J26J) that these conditions can be satisfied by choosing a 
grid of size |5| < (3/e) d (S fills up the space within e distance). Applying union bound to Lemma|6] 
(let /(x) = ((x T <fr)u) 2 ) for all elements in S, we get with probability no less than 1 — £: 



VuG5:||(x T *)u||2(x)<||(x T *)u|| r 2 l + m 



2 

max \ 



'C! c 2 |5| 
n l0g — ' 



(30) 



Let w x = wx / 1 1 wx 1 1 . For any X, since w x e W, there exists w" £ S such that 1 1 w x — w" 1 1 < e. 
Therefore, under event[30]we have: 



(x T *)w x || p(x ^ = ||w x || ||(x T *)w x | 



P(x) 



< 



< 



< 



< 



< 



wxll (||(x T *)(w x - w")|| p(x) + ||(x T *)w"|| p(x) ) 



w x || m max ||w x - w"|| + ||(x T *)w'' 



w x || rn max e+ ||(x T *)(w" - w x ) 



w x || m max e + m max e + ||(x T #)w x | 



|(x T *)w x | 



'ex c 2 \S\ 
— log—— 
n £ 



^log^ 
n £ 



(x T *)w x | 



|wx|| 2e 



<ci c 2 \S\ 
n l0g — 



(31) 
(32) 

(33) 
)(34) 
(35) 
(36) 



Line 33 uses EquationBOj and we use Equation 29 in li nes[34| and|35] Using the definition, we have 
that |]wx|| < II (X$)^JT6 max y / ?i. Thus, using Equation 



28 



we get: 



max ||(X*)t||6 : 
Setting e = \j\fd and substituting |5| we get: 

2 + m max ||(X*)t 
which proves the lemma after simplification. 



|(x r $)w x || p(x) < 26 max 



|(x T $)w x || p(x) < 6 max 



> 2, + f log^ 



(37) 



' Cin log 



c 2 (3/e) rf 



,(38) 

□ 



7.2 Bounding the Regression to Noise Terms 

To bound the regression to the noise, we need the following lemma on martingales: 

Lemma 8. Let M be a matrix of size I x n, in which column t is a function of x t . Then with 
probability 1 — £ we have: 



||Mi,|| <<W||M|| F W21og|- 



(39) 
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Proof. The inner product between each row of M and 77 can be bounded by a concentration inequal- 
ity on martingales each failing with probability less than The lemma follows immediately by 
adding up the inner products. □ 



The following lemma based on mixing conditions is also needed to bound the variance term. 

Lemma 9. With the conditions of the theorem, with probability 1 — £4, there exists a 'Y dxd with all 
the elements in [— 1, 1], and thus || Y|| < d, such that: 



1 



(X*) T X* = S.i. + eoY, 



(40) 



where eq = m 



w £3 WSi^ 
max \/ n xyJ & £ 4 



Stated otherwise, if Y = ~ (~ (X$) T X$ — S$) , then with probability 1 — £ 4 , Y is element-wise 
bounded by ±1. 

Proof. This is a simple application of Lemma [6] to all the elements in ^(X<&) T X<1> using union 
bound, as the expectation is S, the chain is mixing and each element of X<& is bounded by m max . 

□ 



With the above theorem, we can use the Taylor expansion of matrix inversion to have: 
((XSfX*)- 1 = i(E* + eoY)- 1 = ^(S^ 1 - eo^Y^ 1 + O(e 2 )). 
Lemma 10. Under the conditions and probabiliy defined in Theorem 2: 



|(x T *)(X*) t 



P(x) 



< 



^max'^raax 

n 



|S $ 1 ||||X*|L/2A : lo; 



2D 



(41) 



(42) 



+Wr4 ax \/ d J ll s iTll x *H ipfclog^ logf (43) 



2d 



+0(n- 2 ) 



(44) 



Proof. Since X is of rank bigger than d, we have d < n, and with the use of random projec- 
tions X3> is full rank with probability 1 (see e.g. lfT8ll ). We can thus substitute the inverse by 
[(X*) T X*] _1 (X*) T . Using Lemma|9] we get with probability 1 - £4 for all x e X: 



(x T *)(X*)t ?7 || p(x) = ||(x T *)((X*) T X*)- 1 (X*) T ^| 



(x T *) 



P(x) 



(X*) T ?/ 



< 



i(x T $)S $ 1 (X*) T ?? 



P(x) 



P(x) 

(x t *)SJ 1 YSJ 1 (X*) T 7 ? 



£0 







P(x) 





(45) 
(46) 

(47) 



To bound the first term, let be the ith column of ^ (see definition of X in the notation section). 
Thus {e.i}i<i<o is an orthonormal basis under which x e lis sparse, and all e,'s are in X. 
Applying Lemma[8] D times, we get that for all e;, with probability 1 — 



-(ef*)Ei 1 (X*) T r ? 



< 6„ 



-(ef*)S^ 1 (X*) 



<21og : 



2D 



< 



J max" fc max 



|S $ 1 ||||X*|U/21og 



2D 

IT' 



(48) 
(49) 
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The union bound gives us that Line 49 hold simultaneously for all e,'s with probability 1 — £3. 
Therefore with probability 1 — £3 for any x = J2i a i e %- 

v 2 



(x T *)S $ 1 (X*) T r, 



< 



< 



i=l I 

lf>d|(ef*)£^X*)^|) 

n i=i ) 



(50) 
(51) 



"max imal 1 1 1 1 



|X*||i 21oe 



2D 



' D \ 2 

vi=l / 



Because x is fc-sparse, we have that X)2=i — v^fc||x|| < y/k. As the above holds for all x = X, 
it holds for the expectation under p. We thus get: 



(x T *)S $ 1 (X*) T ?/ 



< 



-'max ' "-max 



P(x) 



|S; 1 ||||X*||J2fclo { 



2D 
IT' 



For the second term of Line 47 we first split and then apply Lemma [8] 



^(x^S^YS-^X*)^ < 2||* T x||||E i 1 |ri|Y||||(X*) T » 7 



(53) 



(54) 



n n 
Using Lemma [9] we have with probability 1 — £4 that ||Y|| < d. Applying Lemma [8] to the 
II (X$) T 7y|| term we get with probability 1 — £5 for all x 6 X: 



CO / T 



(x T *)S $ 1 YS $ 1 (X*) T ?? 



< ^H^xlllls; 1 !! 2 ^ 

n " " " " 



|X*| 



1 2 log I (55) 



< — m max lli:^]] 2 d6 max Vd\\X&\\ W21og 



2d 



As the above holds for all x G X, it holds for any expectation on with measures defined on X: 



e °(x T *)S ; 1 YS ; 1 (X*)^ 



< 



P(x) 



|S $ 1 || 2 ||X$|l/21<» ; . - . 



2d 

Is 



(56) 

(57) 

□ 



Substituting eo of Lemma|9] and using Lines 53 and 57 into 47 will finish the proof. 

8 Proof of Error Contraction Lemmas 

This section will finish the proof of the lemmas presented in the paper. 
8.1 Proof of Lemma 3 

Proof of Lemma 3. We have that V* is the fixed point to the Bellman operator (i.e. TV n = V w ), 
and that the operator is a contraction with respect to the weighted L 2 norm on the stationary distri- 
bution p l27l : 

\\TV(x) - rF'(x)|| p(x) < 7 ||V(x) - ^'(x)|| p(x) . (58) 

We thus have: 

||V^(x)-(V(x)+V(x))||, (x) (59) 

< ||^(x)-7-F(x)|| p(x) + ||(TV(x)-y(x))-V(x)|| p(x) (60) 

< ||ry T (x)-mx)|| p(x) + e||ru(x)-y(x)|| p(x) (6i) 

7 ||^(x) - ^(x)|| p(x) + e (||rU(x) - TV(x)|| p(ac) + ||V(x) - K(x)|| p(x) ) (62) 



< 



< (7 + e7 + e)||^(x)-y(x)|| p(x) 



(63) 

□ 



13 



8.2 Proof of Lemma 4 

Proof of Lemma 4. We have that: 

||^(x)-y(x)|| p(x) < \\TV"(x)-TV(x)\\ p(x) + \\TV( X )-V(x)\\ p{x) (64) 

< 7 ||^(x)-y(x)|| p(x) + ||ry(x)-y(x)|| p(x) , (65) 

and thus: 

||^(x)-nx)|| p(x) < ^WTVW-VWW^y (66) 

Let e = c/ Il e y ( x )|| p ( x )- If the contraction does not happen, then due to Lemma 3, we must have: 

1-7 

7 + e7 + e>l e > (67) 

1 + 7 

=> HTF(x)-nx)||, (x) <^c (68) 



'7 
1+7 
(l-7) s 



||y-(x)-nx)|| p(x) < 7r ^c. (69) 



□ 
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